Multiscale Discriminant Saliency for Visual Attention

نویسندگان

  • Anh Cat Le Ngo
  • Li-Minn Ang
  • Guoping Qiu
  • Kah Phooi Seng
چکیده

The bottom-up saliency, an early stage of humans’ visual attention, can be considered as a binary classification problem between center and surround classes. Discriminant power of features for the classification is measured as mutual information between features and two classes distribution. The estimated discrepancy of two feature classes very much depends on considered scale levels; then, multi-scale structure and discriminant power are integrated by employing discrete wavelet features and Hidden markov tree (HMT). With wavelet coefficients and Hidden Markov Tree parameters, quad-tree like label structures are constructed and utilized in maximum a posterior probability (MAP) of hidden class variables at corresponding dyadic sub-squares. Then, saliency value for each dyadic square at each scale level is computed with discriminant power principle and the MAP. Finally, across multiple scales is integrated the final saliency map by an information maximization rule. Both standard quantitative tools such as NSS, LCC, AUC and qualitative assessments are used for evaluating the proposed multiscale discriminant saliency method (MDIS) against the well-know information-based saliency method AIM on its Bruce Database wity eye-tracking data. Simulation results are presented and analyzed to verify the validity of MDIS as well as point out its disadvantages for further research direction. 1 Visual Attention Computational Approach Visual attention is a psychological phenomenon in which human visual systems are optimized for capturing scenic information. Robustness and efficiency of biological devices, the eyes and their control systems, visual paths in the brain have amazed scientists and engineers for centuries. From Neisser [26] to Marr [25], researchers have put intesive effort in discovering attention principles and engineering artificial systems with equivalent capability. For last two decades, this research field is dominated by visual saliency principles which proposes an existence of a saliency map for attention guidance. The idea is further promoted in Feature Integration Theory (FIT) [34] which elaborates computational principles of saliency map generation with center-surround operators and basic image features such as intensity, orientation and colors. Then, Itti et al. [21] implemented and released the first complete computer algorithms of FIT theory 5. Feature Integration Theory are widely accepted as principles behind visual attention partly due to its utilization of basic image features. Moreover, this hypothesis is supported by several evidences from psychological experiments. However, it only defines theoretical aspects of visual attention with saliency maps, but does not investigate how such principles would be implemented algorithmically. It leaves research field open for many later saliency algorithms [21],[16],[32],[19], etc. Saliency might be computed as a linear contrast between features of central and surrounding environments across multiple scales in the center-surround operation. Saliency is also modeled as phase difference in Fourier Transform Domain [20], or saliency at each location depends on statistical modeling of the local feature distribution [32]. Though many approaches are mentioned in long and rich literature of visual saliency, only a few are built on a solid theory or linked to other well-established computational theory. Among the approaches, Neil Bruce’s work [6] nicely established a bridge between visual saliency and information theory. It put a first step for bridging two alien fields; moreover, visual attention for first time could be modeled as information system. Then, information-based visual saliency has continuously been investigated and developed in several works [23],[22], [29], [14]. The distinguish points between these works are computational approaches for information retrieval from features. The process attracts much interest due to difficulty in estimating information of high dimension data like 2-D image patches. It usually runs into 5 http://ilab.usc.edu/toolkit/ ar X iv :1 30 1. 39 64 v1 [ cs .C V ] 1 7 Ja n 20 13 computational problems which can not be efficiently solved due to the curse-of-dimensionality; moreover, central and surrounding contexts are usually defined in ad-hoc manners without much theoretical supports. To encounter the difficulties, Danash Gao et al. has simplified the information extraction step as a binary classification problem with decision theory. Two classes are identified as center and surround contexts then discriminant power or mutual information between features and classes are estimated as saliency values for each location. This formulation of visual saliency approach is named as Discriminant Saliency (DIS) of which underlying principles are carefully elaborated by Gao et al. [18]. Its significant point is estimating information from class distributions given input features rather than from the input features themselves. Therefore, computational load is significantly reduced since only simple class distribution need estimating rather than complex feature distribution. Spatial features have dominated influence on saliency values; however, scale-space features do have a decisive role in visual saliency computation since center or surround environments are simply processing windows with different sizes. In signal processing, scale-space and spectral space are two sides of a coin; therefore, there is a strong relation between scale-frequency-saliency in visual attention problem. Several researchers [3,30,27,33] outlined that fixated regions have high spatial contrast or showed that high-frequency edges allow stronger discrimination between fixateed over non-fixated points. In brief, they all come up with one conclusion increased predictability at high frequencies. Though these studies emphasizes a greater visual attraction to high frequencies (edges, ridges, other structures of images), there are other works focusing on medium frequency. Bruce et al. [7] found that fixation points tend to prefere horizontal and vertical frequency content rather than random position, and these oriented content have more noticeable difference in medium frequencies. More interestingly, choices of frequency range for visual processing may depend on encountering visual context [2]. For example, luminance contrast explained fixation locations better in natural image category and slightly worse in urban scenes category provided that all images are applied low-pass filters as preprocessing steps. Perhaps, that attention system might include different range of frequencies in generating optimal eye-movements. Diversity in spectral space usage means utilization of several different scales in scale-space theory. It can be assumed that both high frequency (small scale) and medium frequency (medium scale) constitutes an ecological relevance and compromise between information requirement and available attentional capacity in the early stage of visual attention when observers are not driven by performing any specific tasks. Though multi-scale nature have been emphasized as implicit element of human visual attention, it is often ignored in several visual saliency algorithm. For example, DIS approach [18] considers only one fixed-size window and it may lead to inconsideration of significant attentive features in a scene. Therefor, DIS approach needs constituting under the multi-scale framework to form multiscale discriminant saliency (MDIS) approach. This is the main motivation as well as contribution of this paper which are organized as follows. Section 2 reviews principles behind DIS [14] and focuses on its important assumption and limitation. After that, MDIS approach is carefully elaborated in section 3 with several relating contents such as multiple dyadic windows for binary classification problem in subsection 3.1, multiscale statistical model of wavelet coefficients in subsection 3.2, maximum likehood (MLL) and maximum a posterior probability (MAP) computation of dyadic subsquares in subsections 3.3, 3.4. Then, all steps of MDIS are combined for final saliency map generation in subsection 3.5. Quantitative and qualitative analysis of the proposed method with different simulation modes are discussed in section 4; moreover, simulation data of MDIS in comparisons with the well-known information-based saliency method AIM [6] are presented with a number of interesting conclusions. Finally, main contributions of this paper as well as further research direction are stated in the conclusion section 5. 2 Visual Attention Discriminant Saliency Saliency mechanism plays a key role in perceptual organization; therefore, recently several researchers attempt to generalize principles for visual saliency. In the decision theoretic point of view, saliency is regarded as power for distinguishing salient and non-salient classes; moreover, discriminant saliency combines classical center-surround hypothesis with derived optimal saliency architecture. In other word, saliency of each image location is identified by the discriminant power of a feature set with respect to the binary classification problem between center and surround classes. Based on decision theory, this discriminant saliency detector can work with variety of stimulus modalities, including intensity, color, orientation and motion. Moreover, various psychophysic property for both static and motion stimuli are shown to be accurately satisfied quantitatively by DIS saliency maps. Perceptual systems evolve for producing optimal decisions about the state of surrounding environments in a decision-theoretic sense with minimum probability of error. Beside accurate decisions, the perceptual mechanisms should be as efficient as possible. Mathematically, the problem needs defining as (1) a binary classification of interest stimuli (salient features) against the null hypothesis (non-salient features) and (2) measurement of discriminant power from extracted visual features as saliency at each location in the visual field. The discriminant power is estimated in classification process with respect to two classes of stimuli: stimuli of interest and null stimuli of all uninterested features. Each location of visual field can be classified whether it includes stimuli of interest optimally with lowest expected probability of error. From pure computational standpoint, the binary classification for discriminant features are widely studied and well-defined as tractable problem in the literature. Moreover, the discriminant saliency concept and the decision theory appear in both top-down and bottom-up problems with different specifications of stimuli of interest [16],[14]. The early stages of biological vision are dominated by the ubiquity of “center-surround” operator; therefore, bottom-up saliency is commonly defined as how certain the stimuli at each location of central visual field can be determined against other stimuli in its surround. In other words, “center-surround” hypothesis is a natural binary classification problem which can be solved by well-established decision theory. In this problem, classes can be defined as follows. – Center class: observations within a central neighborhood W 1 l of visual fields location l. – Surround class: observations within a surrounding window W 0 l of the above central region. At each location, likelihood of either hypothesis depends on the visual stimulus, of a predefined set of features X. The saliency at location l should be measured as discriminating power of features X in W 1 l against features X in W 0 l . In other words, discriminant saliency value is proportional to distance between feature distributions of center and surrounding classes. Feature responses within the windows are drawn from the predefined feature setsX in a process. Since there are many possible combinations and orders of how such responses are assembled, the observations of features can be considered as a random process, X(l) = (X1(l), . . . , Xd(l)) of dimension d. This random process is drawn conditionally on the states of hidden variable Y (l), which is either center or surround state. Feature vector x(j) such that j ∈ W c l , c ∈ {0, 1} are drawn from classes c according to the conditional probability density PX(l)|Y (l)(x|c) where Y (l) = 0 for surround or Y (l) = 1 for center. The saliency of location l, S(l) is equal to the discriminant power of X for the classification of the observed feature vectors. That discriminant concept is quantified by the mutual information between feature, X and class label, Y.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Just Noticeable Difference Estimation Using Visual Saliency in Images

Due to some physiological and physical limitations in the brain and the eye, the human visual system (HVS) is unable to perceive some changes in the visual signal whose range is lower than a certain threshold so-called just-noticeable distortion (JND) threshold. Visual attention (VA) provides a mechanism for selection of particular aspects of a visual scene so as to reduce the computational loa...

متن کامل

Graph-based Visual Saliency Model using Background Color

Visual saliency is a cognitive psychology concept that makes some stimuli of a scene stand out relative to their neighbors and attract our attention. Computing visual saliency is a topic of recent interest. Here, we propose a graph-based method for saliency detection, which contains three stages: pre-processing, initial saliency detection and final saliency detection. The initial saliency map i...

متن کامل

Modelling Visual Saliency Using Degree Centrality

Visual attention is an indispensable component of complex vision tasks. In this paper, a multiscale, complex network-based approach for determining visual saliency is described. It uses degree centrality (conceptually and computationally the simplest among all the centrality measures) over a network of image regions to form a saliency map. The regions used in the network are multiscale in natur...

متن کامل

Object recognition with hierarchical discriminant saliency networks

The benefits of integrating attention and object recognition are investigated. While attention is frequently modeled as a pre-processor for recognition, we investigate the hypothesis that attention is an intrinsic component of recognition and vice-versa. This hypothesis is tested with a recognition model, the hierarchical discriminant saliency network (HDSN), whose layers are top-down saliency ...

متن کامل

Compressed-Sampling-Based Image Saliency Detection in the Wavelet Domain

When watching natural scenes, an overwhelming amount of information is delivered to the Human Visual System (HVS). The optic nerve is estimated to receive around 108 bits of information a second. This large amount of information can’t be processed right away through our neural system. Visual attention mechanism enables HVS to spend neural resources efficiently, only on the selected parts of the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013